Bank Marketing Dataset

Bank Marketing Dataset

In this article, we work on a dataset available from the UCI Machine Learning Repository. The data is related to direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe to a term deposit (variable y).

This dataset is based on the Bank Marketing dataset from the UC Irvine Machine Learning Repository. The data is enriched by the addition of five new social and economic features/attributes (national wide indicators from a ~10M population country), published by the Banco de Portugal and publicly available at bportugal.pt/estatisticasweb. This dataset is almost identical to the one used in [Moro et al., 2014] (it does not include all attributes due to privacy concerns).

Dataset Information:

The data is related to the direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed.

There are four datasets:

  1. bank-additional-full.csv with all examples (41188) and 20 inputs, ordered by date (from May 2008 to November 2010), very close to the data analyzed in [Moro et al., 2014]
  2. bank-additional.csv with 10% of the examples (4119), randomly selected from 1), and 20 inputs.
  3. bank-full.csv with all examples and 17 inputs, ordered by date (older version of this dataset with fewer inputs).
  4. bank.csv with 10% of the examples and 17 inputs, randomly selected from 3 (older version of this dataset with fewer inputs).

The classification goal is to predict if the client will subscribe (yes/no) a term deposit (variable y).

Loading the Dataset

The zip file includes two datasets:

  1. bank-additional-full.csv with all examples, ordered by date (from May 2008 to November 2010).
  2. bank-additional.csv with 10% of the examples (4119), randomly selected from bank-additional-full.csv. The smallest dataset is provided to test more computationally demanding machine learning algorithms (e.g., SVM).

The binary classification goal is to predict if the client will subscribe to a bank term deposit (variable y).

Number of Instances Number of Attributes
41188 21

Bank Client Data

Feature Description
Age numeric
Job Type of Job (categorical: "admin.","blue-collar","entrepreneur","housemaid","management","retired","self-employed","services","student","technician","unemployed","unknown")
Marital marital status (categorical: "divorced","married","single","unknown"; note: "divorced" means divorced or widowed)
Education (categorical: "basic.4y","basic.6y","basic.9y","high.school","illiterate","professional.course","university.degree","unknown")
Default has credit in default? (categorical: "no","yes","unknown")
Housing has housing loan? (categorical: "no","yes","unknown")
Loan has personal loan? (categorical: "no","yes","unknown")
Feature Description
Contact contact communication type (categorical: "cellular","telephone")
Month last contact month of year (categorical: "jan", "feb", "mar", ..., "nov", "dec")
Day of week last contact day of the week (categorical: "mon","tue","wed","thu","fri")
Duration last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y="no"). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

Other Attributes

Feature Description
Campaign number of contacts performed during this campaign and for this client (numeric, includes last contact)
Pdays number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
Previous number of contacts performed before this campaign and for this client (numeric)
Poutcome outcome of the previous marketing campaign (categorical: "failure","nonexistent","success")

Social and Economic Context Attributes

Feature Description
Employment Variation Rate employment variation rate - quarterly indicator (numeric)
Consumer Price Index consumer price index - monthly indicator (numeric)
Consumer Confidence Index consumer confidence index - monthly indicator (numeric)
Euribor three Month Rate euribor* 3 month rate - daily indicator (numeric)
Number of Employees number of employees - quarterly indicator (numeric)

* the basic rate of interest used in lending between banks on the European Union interbank market and also used as a reference for setting the interest rate on other loans.

Output variable (Desired Target):

Feature Description
Term Deposit Subscription has the client Term Deposit Subscription? (binary: "yes","no")

Preparing the Dataset

Categorical Variables

Yes/No Features

First, let's convert all Yes/No columns using as follows

$$\begin{cases} -1 & \mbox{Unknown}\\0 &\mbox{No}\\ 1 &\mbox{Yes}\end{cases}$$

Moreover,

Poutcome

For these features, we have,

$$\mbox{Poutcome} = \begin{cases} -1 & \mbox{Nonexistent}\\0 &\mbox{Failure}\\ 1 &\mbox{Success}\end{cases}$$

Marital

$$\mbox{Poutcome} = \begin{cases} -1 & \mbox{Unknown}\\0 &\mbox{Single}\\ 1 &\mbox{Married}\\ 2 &\mbox{Divorced}\end{cases}$$

Day Of Week

$$\mbox{Day Of Week} = \begin{cases} 0 & \mbox{Monday}\\ 1 &\mbox{Tuesday}\\ 2 &\mbox{Wednesday}\\ 3 &\mbox{Thursday}\\ 4 &\mbox{Friday}\\ 5 &\mbox{Saturday}\\ 6 &\mbox{Sunday} \end{cases}$$

Contact

$$\mbox{Contact} = \begin{cases} 0 & \mbox{Telephone}\\ 1 &\mbox{Cellular} \end{cases}$$

Job

Month

Education

Pdays

Age Group and Age Category

Creating new features:

We can create Age Categories using statcan.gc.ca.

Interval Age Category Age Category Code
00-14 years Children 0
15-24 years Youth 1
25-64 years Adults 2
65 years and over Seniors 3

Therefore,

Data Correlations

Let's take a look at the variance of the features.

Furthermore, we would like to standardize features by removing the mean and scaling to unit variance. In this article, we demonstrated the benefits of scaling data using StandardScaler().

Saving to a CSV

Train and Test Sets

First, consider the data distribution for Term Deposit Subscription.

StratifiedKFold is a variation of k-fold which returns stratified folds: each set contains approximately the same percentage of samples of each target class as the complete set.

Therefore, we have divided the dataset into train and test set using stratification that preserves the distribution of classes in train and test sets.

Modeling: PyTorch Multi-layer Perceptron (MLP) for Binary classification

A multi-layer perceptron (MLP) is a class of feedforward artificial neural network (ANN). The algorithm at each iteration uses the Cross-Entropy Loss to measure the loss, and then the gradient and the model update is calculated. At the end of this iterative process, we would reach a better level of agreement between test and predicted sets since the error would be lower from that of the first step.

Setting up Tensor Arrays

Modeling

Model Optimization Plot

Confusion Matrix, ROC, ...

The confusion matrix allows for visualization of the performance of an algorithm.

Note that:

where $T_p$, $T_n$, $F_p$, and $F_n$ represent true positive, true negative, false positive, and false negative, respectively.

\begin{align} \text{Precision} = \frac{T_p}{T_p+F_p} \end{align} \begin{align} \text{Recall} = \frac{T_p}{T_p + F_n} \end{align}

However, the accuracy can be a misleading metric for imbalanced data sets. Here, over 88 percent of the sample has negative (No) and about 12 percent has positive (Yes) values. In these cases, a balanced accuracy (bACC) [4] is recommended that normalizes true positive and true negative predictions by the number of positive and negative samples, respectively, and divides their sum by two:

\begin{align} \text{TPR} &= \frac{T_p}{T_p + F_n},\\ \text{TNR} &= \frac{T_N}{T_p + F_p},\\ \text{Balanced Accuracy (bACC)} &= \frac{TPR+TNR}{2} = \frac{1}{2}\left(\frac{T_p}{T_p + F_n} + \frac{T_N}{T_p + F_p}\right) \end{align}

Another metric is the predicted positive condition rate (PPCR), which identifies the percentage of the total population that is flagged

\begin{align} \text{Predicted positive condition rate (PPCR)}={\frac {tp+fp}{tp+fp+tn+fn}} \end{align}

Predictions

Conclutions

Although the model is doing pretty well considering the complexity of this problem, we can improve the results by designing an iterative optimization that utilizes the accuracy and recall scores.


Refrences

  1. S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014

  2. S. Moro, R. Laureano and P. Cortez. Using Data Mining for Bank Direct Marketing: An Application of the CRISP-DM Methodology. In P. Novais et al. (Eds.), Proceedings of the European Simulation and Modelling Conference - ESM'2011, pp. 117-121, Guimaraes, Portugal, October, 2011. EUROSIS. [bank.zip]

  3. Scikit-learn Precision-Recall

  4. Mower, Jeffrey P. "PREP-Mt: predictive RNA editor for plant mitochondrial genes." BMC bioinformatics 6.1 (2005): 1-15.

  5. Precision and recall wikipedia page